33 research outputs found
Evaluating Unsupervised Dutch Word Embeddings as a Linguistic Resource
Word embeddings have recently seen a strong increase in interest as a result
of strong performance gains on a variety of tasks. However, most of this
research also underlined the importance of benchmark datasets, and the
difficulty of constructing these for a variety of language-specific tasks.
Still, many of the datasets used in these tasks could prove to be fruitful
linguistic resources, allowing for unique observations into language use and
variability. In this paper we demonstrate the performance of multiple types of
embeddings, created with both count and prediction-based architectures on a
variety of corpora, in two language-specific tasks: relation evaluation, and
dialect identification. For the latter, we compare unsupervised methods with a
traditional, hand-crafted dictionary. With this research, we provide the
embeddings themselves, the relation evaluation task benchmark for use in
further research, and demonstrate how the benchmarked embeddings prove a useful
unsupervised linguistic resource, effectively used in a downstream task.Comment: in LREC 201
NeuTral Rewriter:A Rule-Based and Neural Approach to Automatic Rewriting into Gender-Neutral Alternatives
Recent years have seen an increasing need for gender-neutral and inclusive language. Within the field of NLP, there are various mono- and bilingual use cases where gender inclusive language is appropriate, if not preferred due to ambiguity or uncertainty in terms of the gender of referents. In this work, we present a rule-based and a neural approach to gender-neutral rewriting for English along with manually curated synthetic data (WinoBias+) and natural data (OpenSubtitles and Reddit) benchmarks. A detailed manual and automatic evaluation highlights how our NeuTral Rewriter, trained on data generated by the rule-based approach, obtains word error rates (WER) below 0.18% on synthetic, in-domain and out-domain test sets
Adversarial Stylometry in the Wild: Transferable Lexical Substitution Attacks on Author Profiling
Written language contains stylistic cues that can be exploited to
automatically infer a variety of potentially sensitive author information.
Adversarial stylometry intends to attack such models by rewriting an author's
text. Our research proposes several components to facilitate deployment of
these adversarial attacks in the wild, where neither data nor target models are
accessible. We introduce a transformer-based extension of a lexical replacement
attack, and show it achieves high transferability when trained on a weakly
labeled corpus -- decreasing target model performance below chance. While not
completely inconspicuous, our more successful attacks also prove notably less
detectable by humans. Our framework therefore provides a promising direction
for future privacy-preserving adversarial attacks.Comment: Accepted to EACL 202
Style Obfuscation by Invariance
The task of obfuscating writing style using sequence models has previously
been investigated under the framework of obfuscation-by-transfer, where the
input text is explicitly rewritten in another style. These approaches also
often lead to major alterations to the semantic content of the input. In this
work, we propose obfuscation-by-invariance, and investigate to what extent
models trained to be explicitly style-invariant preserve semantics. We evaluate
our architectures on parallel and non-parallel corpora, and compare automatic
and human evaluations on the obfuscated sentences. Our experiments show that
style classifier performance can be reduced to chance level, whilst the
automatic evaluation of the output is seemingly equal to models applying
style-transfer. However, based on human evaluation we demonstrate a trade-off
between the level of obfuscation and the observed quality of the output in
terms of meaning preservation and grammaticality.Comment: Accepted for presentation at COLING1
Native Language Identification with Big Bird Embeddings
Native Language Identification (NLI) intends to classify an author's native
language based on their writing in another language. Historically, the task has
heavily relied on time-consuming linguistic feature engineering, and
transformer-based NLI models have thus far failed to offer effective, practical
alternatives. The current work investigates if input size is a limiting factor,
and shows that classifiers trained using Big Bird embeddings outperform
linguistic feature engineering models by a large margin on the Reddit-L2
dataset. Additionally, we provide further insight into input length
dependencies, show consistent out-of-sample performance, and qualitatively
analyze the embedding space. Given the effectiveness and computational
efficiency of this method, we believe it offers a promising avenue for future
NLI work
Neural Data-to-Text Generation Based on Small Datasets: Comparing the Added Value of Two Semi-Supervised Learning Approaches on Top of a Large Language Model
This study discusses the effect of semi-supervised learning in combination
with pretrained language models for data-to-text generation. It is not known
whether semi-supervised learning is still helpful when a large-scale language
model is also supplemented. This study aims to answer this question by
comparing a data-to-text system only supplemented with a language model, to two
data-to-text systems that are additionally enriched by a data augmentation or a
pseudo-labeling semi-supervised learning approach.
Results show that semi-supervised learning results in higher scores on
diversity metrics. In terms of output quality, extending the training set of a
data-to-text system with a language model using the pseudo-labeling approach
did increase text quality scores, but the data augmentation approach yielded
similar scores to the system without training set extension. These results
indicate that semi-supervised learning approaches can bolster output quality
and diversity, even when a language model is also present.Comment: 22 pages (excluding bibliography and appendix
Tailoring Domain Adaptation for Machine Translation Quality Estimation
While quality estimation (QE) can play an important role in the translation
process, its effectiveness relies on the availability and quality of training
data. For QE in particular, high-quality labeled data is often lacking due to
the high-cost and effort associated with labeling such data. Aside from the
data scarcity challenge, QE models should also be generalizable, i.e., they
should be able to handle data from different domains, both generic and
specific. To alleviate these two main issues -- data scarcity and domain
mismatch -- this paper combines domain adaptation and data augmentation within
a robust QE system. Our method is to first train a generic QE model and then
fine-tune it on a specific domain while retaining generic knowledge. Our
results show a significant improvement for all the language pairs investigated,
better cross-lingual inference, and a superior performance in zero-shot
learning scenarios as compared to state-of-the-art baselines.Comment: Accepted to EAMT 2023 (main
Tailoring Domain Adaptation for Machine Translation Quality Estimation
While quality estimation (QE) can play an important role in the translation process, its effectiveness relies on the availability and quality of training data. For QE in particular, high-quality labeled data is often lacking due to the high cost and effort associated with labeling such data. Aside from the data scarcity challenge, QE models should also be generalizable, i.e., they should be able to handle data from different domains, both generic and specific. To alleviate these two main issues -- data scarcity and domain mismatch -- this paper combines domain adaptation and data augmentation within a robust QE system. Our method first trains a generic QE model and then fine-tunes it on a specific domain while retaining generic knowledge. Our results show a significant improvement for all the language pairs investigated, better cross-lingual inference, and a superior performance in zero-shot learning scenarios as compared to state-of-the-art baselines
Tailoring Domain Adaptation for Machine Translation Quality Estimation
While quality estimation (QE) can play an important role in the translation process, its effectiveness relies on the availability and quality of training data. For QE in particular, high-quality labeled data is often lacking due to the high cost and effort associated with labeling such data. Aside from the data scarcity challenge, QE models should also be generalizable, i.e., they should be able to handle data from different domains, both generic and specific. To alleviate these two main issues -- data scarcity and domain mismatch -- this paper combines domain adaptation and data augmentation within a robust QE system. Our method first trains a generic QE model and then fine-tunes it on a specific domain while retaining generic knowledge. Our results show a significant improvement for all the language pairs investigated, better cross-lingual inference, and a superior performance in zero-shot learning scenarios as compared to state-of-the-art baselines